hello there !!

2+2 
[1] 4
4+5
[1] 9

copy r formula ctrl +shift + i

data("present")
View(present)

Shortcuts

piping operator: ctrl + shift + m

assign operator: alt + -

Assignment

library(dplyr)
library(ggplot2)
library(statsr)
data(arbuthnot)
dim(arbuthnot)
[1] 82  3
names(arbuthnot)
[1] "year"  "boys"  "girls"
arbuthnot$boys
 [1] 5218 4858 4422 4994 5158 5035 5106 4917 4703 5359 5366 5518
[13] 5470 5460 4793 4107 4047 3768 3796 3363 3079 2890 3231 3220
[25] 3196 3441 3655 3668 3396 3157 3209 3724 4748 5216 5411 6041
[37] 5114 4678 5616 6073 6506 6278 6449 6443 6073 6113 6058 6552
[49] 6423 6568 6247 6548 6822 6909 7577 7575 7484 7575 7737 7487
[61] 7604 7909 7662 7602 7676 6985 7263 7632 8062 8426 7911 7578
[73] 8102 8031 7765 6113 8366 7952 8379 8239 7840 7640
ggplot(data = arbuthnot, aes(x = year, y = girls)) +
  geom_point()

arbuthnot <- arbuthnot %>%
  mutate(total = boys + girls)
ggplot(data = arbuthnot, aes(x = year, y = total)) +
  geom_line()

ggplot(data = arbuthnot, aes(x = year, y = total)) +
  geom_line() +
  geom_point()

arbuthnot <- arbuthnot %>%
  mutate(more_boys = boys > girls)
data(present)
dim(present)
[1] 74  3

Calculate the total number of births for each year and store these values in a new variable called total in the present dataset. Then, calculate the proportion of boys born each year and store these values in a new variable called prop_boys in the same dataset. Plot these values over time and based on the plot determine if the following statement is true or false: The proportion of boys born in the US has decreased over time.

True

False

present <- present %>% 
mutate(total = boys + girls)%>% 
mutate(prop_boys = boys/total)

ggplot(data = present, aes(x = year, y = prop_boys)) + geom_line()

Create a new variable called more_boys which contains the value of either TRUE if that year had more boys than girls, or FALSE if that year did not. Based on this variable which of the following statements is true?

Every year there are more girls born than boys.

Every year there are more boys born than girls.

Half of the years there are more boys born, and the other half more girls born.

present <- present %>%
mutate(more_boys = boys > girls)

ggplot(data = present, aes(x = year, y = more_boys)) + geom_line()

Calculate the boy-to-girl ratio each year, and store these values in a new variable called prop_boy_girl in the present dataset. Plot these values over time. Which of the following best describes the trend?

There appears to be no trend in the boy-to-girl ratio from 1940 to 2013.

There is initially an increase in boy-to-girl ratio, which peaks around 1960. After 1960 there is a decrease in the boy-to-girl ratio, but the number begins to increase in the mid 1970s.

There is initially a decrease in the boy-to-girl ratio, and then an increase between 1960 and 1970, followed by a decrease. The boy-to-girl ratio has increased over time.

There is an initial decrease in the boy-to-girl ratio born but this number appears to level around 1960 and remain constant since then.

present <- present %>%
mutate(prop_boy_girl = boys/girls)

ggplot(data = present, aes(x = year, y = prop_boy_girl)) + geom_line()
names(nycflights)
 [1] "year"      "month"     "day"      
 [4] "dep_time"  "dep_delay" "arr_time" 
 [7] "arr_delay" "carrier"   "tailnum"  
[10] "flight"    "origin"    "dest"     
[13] "air_time"  "distance"  "hour"     
[16] "minute"   
view(nycflights)
Error in view(nycflights) : could not find function "view"
str(nycflights)
Classes ‘tbl_df’ and 'data.frame':  32735 obs. of  16 variables:
 $ year     : int  2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...
 $ month    : int  6 5 12 5 7 1 12 8 9 4 ...
 $ day      : int  30 7 8 14 21 1 9 13 26 30 ...
 $ dep_time : int  940 1657 859 1841 1102 1817 1259 1920 725 1323 ...
 $ dep_delay: num  15 -3 -1 -4 -3 -3 14 85 -10 62 ...
 $ arr_time : int  1216 2104 1238 2122 1230 2008 1617 2032 1027 1549 ...
 $ arr_delay: num  -4 10 11 -34 -8 3 22 71 -8 60 ...
 $ carrier  : chr  "VX" "DL" "DL" "DL" ...
 $ tailnum  : chr  "N626VA" "N3760C" "N712TW" "N914DL" ...
 $ flight   : int  407 329 422 2391 3652 353 1428 1407 2279 4162 ...
 $ origin   : chr  "JFK" "JFK" "JFK" "JFK" ...
 $ dest     : chr  "LAX" "SJU" "LAX" "TPA" ...
 $ air_time : num  313 216 376 135 50 138 240 48 148 110 ...
 $ distance : num  2475 1598 2475 1005 296 ...
 $ hour     : num  9 16 8 18 11 18 12 19 7 13 ...
 $ minute   : num  40 57 59 41 2 17 59 20 25 23 ...

rdu_flights <- nycflights %>%
  filter(dest == "RDU")

ggplot(data = rdu_flights, aes(x = dep_delay)) +
  geom_histogram()

rdu_flights %>%
  summarise(mean_dd = mean(dep_delay), sd_dd = sd(dep_delay), n = n())
  1. Create a new data frame that includes flights headed to SFO in February, and save this data frame as sfo_feb_flights. How many flights meet these criteria?
    1. 68
    2. 1345
    3. 2286
    4. 3563
    5. 32735
sfo_feb_flights <- nycflights %>%
  filter(dest == "SFO", month == 2)

sfo_feb_flights %>%
  summarise(n = n())
  1. Make a histogram and calculate appropriate summary statistics for arrival delays of sfo_feb_flights. Which of the following is false?
    1. The distribution is unimodal.
    2. The distribution is right skewed.
    3. No flight is delayed more than 2 hours.
    4. The distribution has several extreme values on the right side.
    5. More than 50% of flights arrive on time or earlier than scheduled.
ggplot(data = sfo_feb_flights, aes(x = arr_delay)) +
  geom_histogram(binwidth = 10)

nycflights <- nycflights %>%
  mutate(dep_type = ifelse(arr_delay < 5, "on time", "delayed"))

nycflights %>%
  group_by(origin) %>%
  summarise(ot_dep_rate = sum(dep_type == "on time") / n()) %>%   
  arrange(desc(ot_dep_rate))

---
title: "Introduction to probability and data"
author: "abhimanyu nath"
output: html_notebook
---

**hello there !!**


```{r}
2+2 
4+5
```
copy r formula ctrl +shift + i
```{r}

```

![](images/1.png)

![](images/2.png)
![](images/3.png)
![](images/4.png)
![](images/5.png)
![](images/6.png)
![](images/7.png)
![](images/8.png)
![](images/9.png)
![](images/10.png)
![](images/11.png)

```{r}
data("present")
View(present)
```

**Shortcuts**

piping operator: ctrl + shift + m

assign operator: alt + -

# Assignment

```{r}
library(dplyr)
library(ggplot2)
library(statsr)
```


```{r}
data(arbuthnot)
dim(arbuthnot)
names(arbuthnot)
arbuthnot$boys
ggplot(data = arbuthnot, aes(x = year, y = girls)) +
  geom_point()
arbuthnot <- arbuthnot %>%
  mutate(total = boys + girls)
ggplot(data = arbuthnot, aes(x = year, y = total)) +
  geom_line()
ggplot(data = arbuthnot, aes(x = year, y = total)) +
  geom_line() +
  geom_point()
arbuthnot <- arbuthnot %>%
  mutate(more_boys = boys > girls)
```


```{r}
data(present)
dim(present)
```
Calculate the total number of births for each year and store these values in a new variable called total in the  present dataset. Then, calculate the proportion of boys born each year and store these values in a new variable called prop_boys in the same dataset. Plot these values over time and based on the plot determine if the following statement is true or false: The proportion of boys born in the US has decreased over time.

True

False
```{r}
present <- present %>% 
mutate(total = boys + girls)%>% 
mutate(prop_boys = boys/total)

ggplot(data = present, aes(x = year, y = prop_boys)) + geom_line()

```

Create a new variable called more_boys which contains the value of either TRUE if that year had more boys than girls, or FALSE if that year did not. Based on this variable which of the following statements is true?

Every year there are more girls born than boys.

Every year there are more boys born than girls.

Half of the years there are more boys born, and the other half more girls born.

```{r}
present <- present %>%
mutate(more_boys = boys > girls)

ggplot(data = present, aes(x = year, y = more_boys)) + geom_line()
```

Calculate the boy-to-girl ratio each year, and store these values in a new variable called prop_boy_girl in the  present dataset. Plot these values over time. Which of the following best describes the trend?

There appears to be no trend in the boy-to-girl ratio from 1940 to 2013.

There is initially an increase in boy-to-girl ratio, which peaks around 1960. After 1960 there is a decrease in the boy-to-girl ratio, but the number begins to increase in the mid 1970s.

There is initially a decrease in the boy-to-girl ratio, and then an increase between 1960 and 1970, followed by a decrease.
The boy-to-girl ratio has increased over time.

There is an initial decrease in the boy-to-girl ratio born but this number appears to level around 1960 and remain constant since then.

```{r}
present <- present %>%
mutate(prop_boy_girl = boys/girls)

ggplot(data = present, aes(x = year, y = prop_boy_girl)) + geom_line()
```















```{r}
data("nycflights")
```


```{r}
names(nycflights)
```

```{r}
View(nycflights)
```

```{r}
?nycflights
```

```{r}
str(nycflights)
```

```{r}
ggplot(data = nycflights, aes(x = dep_delay)) + geom_histogram()
```


```{r}
ggplot(data = nycflights, aes(x = dep_delay)) + geom_histogram(binwidth = 150)
```



```{r}
rdu_flights <- nycflights %>%
  filter(dest == "RDU")

ggplot(data = rdu_flights, aes(x = dep_delay)) +
  geom_histogram()
```


```{r}
rdu_flights %>%
  summarise(mean_dd = mean(dep_delay), sd_dd = sd(dep_delay), n = n())
```
1. Create a new data frame that includes flights headed to SFO in February, and save 
this data frame as `sfo_feb_flights`. How many flights meet these criteria? 
<ol>
<li> 68 </li> 
<li> 1345 </li> 
<li> 2286 </li> 
<li> 3563 </li>
<li> 32735 </li>
</ol>
```{r}
sfo_feb_flights <- nycflights %>%
  filter(dest == "SFO", month == 2)

sfo_feb_flights %>%
  summarise(n = n())
```
2. Make a histogram and calculate appropriate summary statistics for **arrival** 
delays of `sfo_feb_flights`. Which of the following is false? 
<ol>
<li> The distribution is unimodal. </li> 
<li> The distribution is right skewed. </li> 
<li> No flight is delayed more than 2 hours. </li> 
<li> The distribution has several extreme values on the right side. </li>
<li> More than 50% of flights arrive on time or earlier than scheduled. </li>
</ol>

```{r}
sfo_feb_flights
```



```{r}
ggplot(data = sfo_feb_flights, aes(x = arr_delay)) +
  geom_histogram(binwidth = 10)
```



```{r}
sfo_feb_flights %>%
  group_by(carrier) %>%
  summarise(IQR(arr_delay), median(arr_delay), n = n())
```


```{r}
nycflights %>%
  group_by(month) %>%
  summarise(mean_dd = mean(dep_delay),n = n()) %>%
  arrange(desc(mean_dd))
```


```{r}
nycflights %>%
  group_by(month) %>%
  summarise(median_dd = median(dep_delay),n = n()) %>%
  arrange(desc(median_dd))
```


```{r}
ggplot(nycflights, aes(x = factor(month), y = dep_delay)) +
  geom_boxplot()
```


```{r}
nycflights <- nycflights %>%
  mutate(dep_type = ifelse(arr_delay < 5, "on time", "delayed"))

nycflights %>%
  group_by(origin) %>%
  summarise(ot_dep_rate = sum(dep_type == "on time") / n()) %>%   
  arrange(desc(ot_dep_rate))
```

```{r}
ggplot(data = nycflights, aes(x = origin, fill = dep_type)) +
  geom_bar()
```

```{r}
nycflights<-nycflights %>% 
  mutate(avg_speed = distance/air_time) %>% 
  arrange(desc(avg_speed))
select(nycflights,avg_speed, tailnum)
```


```{r}
nycflights
```


```{r}
ggplot(data = nycflights, aes(y = avg_speed, x = distance)) + geom_point()
```
```{r}
nycflights <- nycflights %>% 
  mutate(arr_type = ifelse(arr_delay<=0,"on time","delayed")) 

nycflights %>% 
  group_by(dep_type)%>%
  summarise(ot_arr_rate=sum(arr_type=="on time")/n())

```
















  





